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Abstract. On-line data assimilation techniques such as ensemble Kalman filters and particle filters tend to loose accu- 
racy dramatically when presented with an unlikely observation. Such an observation may be caused by an unusually large 
measurement error or reflect a rare fluctuation in the dynamics of the system. Over a long enough span of time it becomes 
likely that one or several of these events will occur. In some cases they are signatures of the most interesting features of 
the underlying system and their prediction becomes the primary focus of the data assimilation procedure. The Kuroshio 
current that runs along the eastern coast of Japan is an example of just such a system. It undergoes infrequent but dramatic 
changes of state between a small meander during which the cuixent remains close to the coast of Japan, and a large meander 
during which the cun'ent bulges away from the coast. Because of the important role that the Kuroshio plays in distributing 
heat and salinity in the sun'ounding region, prediction of these transitions is of acute interest. Here we propose several data 
assimilation strategies capable of efficiently handling rare events such as the transitions of the Kuroshio current in situations 
where both the stochastic forcing on the system and the observational noise are small. In this regime, large deviation theory 
can be used to understand why standard filtering methods fail and guide the design of the more effective data assimilation 
techniques suggested here. These techniques are tested on the Kuroshio and shown to perform much better than standard 
filtering methods. 



1. Introduction 

The assimilation of noisy observations into a model to improve its predictive capabilities is a recurring challenge 
in many applications. Examples include weather prediction and forecasting, robot tracking, stochastic volatility esti- 
mation, image analysis, etc. (see |8 1). In these applications and many others one is interested in predicting how the 
system evolves in time given a model for its dynamics and sequentially available, incomplete observations of its state. 
For practical reasons it is desirable to assimilate the observations in real time ("on-Une") via a recursive algorithm that 
requires only the latest observation together with the previous estimate of the system's state. In the simplest case of 
Gaussian (i.e. linear) evolution and Gaussian observations, a solution to this problem is given by the Kalman filter 
(see ifTSll ) and extensions thereof (see 10111251 [T6l ). which predict how the mean and the variance of the system's state 
evolve given the observations. For problems with significantly non-Gaussian features, Kalman filters are unsuitable 
(see |(2l|2T]|20l and the numerical results in Section 5|5.1 1, and particle filters, first suggested in [14] and 1 19|, can be 



used instead. Particle filters, also known as sequential Monte Carlo methods, are recursive assimilation algorithms that 
predict how the posterior distribution of the system's state evolves given the observations. They can in principle be 
applied to general, nonlinear, non-Gaussian situations, though they can be impractical in high dimensional problems 
(see lIU). An additional problem that both Kalman filters and particle filters share is that they tend to fail when the 
system undergoes occasional, unusually large transitions revealed by an observation that is inconsistent with the pre- 
dictive distribution of the system's state. Over long periods of time such transitions are inevitable and in some systems 
they are precisely the events of main interest. Our aim here is to develop filtering techniques capable of handling such 
situations. 

An interesting example of a system exhibiting very occasional but interesting transitions that present significant 
challenges to standard data assimilation strategies is the Kuroshio current running along the eastern coast of Japan. 
The Kuroshio current exhibits transitions between a small meander during which the current remains close to the 
coast of Japan, and a large meander during which the current bulges away from the coast (see Figures [T] and |2]i. The 
Kuroshio's central role in distributing heat and salinity in the surrounding region has led to many studies of its bimodal 
behavior beginning with a study by Yoshida in 1961 (see 1341 ). The meanders typically persist for 5-10 years, while 
the transitions between them occur in only a few months. Here we focus on a simple model qualitatively capturing the 
bimodal behavior of the Kuroshio as a test bed for our new filtering strategies, and we show that they remain effective 
in a regime where standard filtering techniques fail dramatically. Our general statements and results apply to systems 
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Figure 1 . Paths in the small meander state. (Reproduced from ll26l . Originally adapted from li28J .) 



exhibiting less dramatic and less interesting rare events, as these also occur eventually and lead to a loss of accuracy 
in standard filtering techniques. 

We shall focus primarily on situations where the system is forced by a small stochastic term and the observational 
noise is small, as this regime is the most challenging for standard filters. In this low noise, accurate observation regime, 
the behavior of rare events such as the transitions of the Kuroshio can be understood within the framework of large 
deviations theory (see [TO] [T3] ^0|). This theory is built on the key observation that, when a rare event occurs, 
it typically does so in a predicable way by the most likely path possible. While the event is rare and the likelihood 
of observing this path is small, the likelihood of observing any other path is much smaller. The identification of the 
most likely path is at the core of a family of data assimilation techniques, called variational methods (e.g. 3DVar and 
4DVar), that do effectively handle non-linearities in the underlying system (see [27 1). These methods do not involve 
any sampling. Instead they find the solution to a large optimization problem which gives either the most likely current 
state of the system given the observations or the most likely path of the system over several observational windows. 
Unfortunately these methods provide only an estimate of the most likely trajectory of the system given the observations 
and do not provide further information about the distribution of the system. Moreover, they are not directly suitable 
for on-line data assimilation. 

Several authors have suggested combining sampling methods with variational methods (see e.g. 1331 and references 
therein). The framework we suggest in this paper follows this line of thought and naturally leads to hybrid data assim- 
ilation techniques sharing characteristics of both sampling based filters and variational methods. More specifically, 
we use information about the most likely path for the event to do importance sampling within the particle filters. This 
approach not only leads to filters that remain accurate in the presence of rare events, but it also allows one to predict 
the behavior of new and existing data assimilation schemes. We demonstrate that, in the small noise regime that is our 
focus, standard schemes can be expected to behave very poorly when the underlying system undergoes a rare large 
change in its state. The framework and methods suggested here should also shed some light on interesting and related 
methods such as path sampling based techniques (see ||Tl|2l|32l) and implicit sampling (see QIS). 
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Figure 2. Paths in the large meander state. (Reproduced from ||26l . Originally adapted from ||28l .) 



The remainder of this paper is organized as follows. In Section[2]we describe the filtering problem along with the 
most basic versions of the Kalman filter (Section 2|2.1 1 and a particle filter (Section 2|2.2 1. In Sectionj3]we discuss 
the difficulties that these filters encounter in the small noise, accurate observation regime. In Section B] we propose 
several hybrid sampling strategies. Section |5] presents the results of our numerical experiments on a simple model 
of the Kuroshio. Some conclusions are given in Section [6] Finally, in Appendices A and B we give details of some 
aspects of the numerical implementation of our algorithms. 



2. Standard filtering methods 

The goal of any discrete-time filtering algorithm is to approximately reconstruct the trajectory of some time- 
dependent process X{t) from observations Hi, H2, ... taken at a discrete set of times ti, t2, ■■■ The observations 
are typically incomplete and made with measurement error, i.e. they are of the form 

i/„ = G(X(t„)) 

where the G(x) is a random function modeling noisy observations of some part of the system. We will assume that, 
conditioned on X(<„) ~ Xn, the observations, i/„, admit a probability density function 

Pih„\x„) 

where here and below we follow standard practice and label random variables with upper case letters (e.g. X{tn), Hn, 
etc.) and the values assumed by those random numbers with lower case letters (e.g. a;„, /i„, etc.). The underlying 
process X{t) should be considered "hidden" and revealed only through the observations. Ideally one would like to 
calculate modes and moments of the conditional distribution of the hidden signal X{t) given these observations. For 
example, one may wish to approximate the expectation of f{X{tn)) conditional on Hi — hi, Hn = hn, 

(!) E[/(X(t„))|/ii,...,/i„] 
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When the underlying process X{t) is Markovian there are several recursive approaches to the approximation of 
quantities such as Q. These methods are based on the relationship 

/ I, , N P{hn\Xn) J PiXn\Xn-l)p(Xn-l\hi,. . . ,hn^i) dXn^i 

(2) P [Xn \hi,...,hn) = 77— TV T ^ 

P[hn\hi,. . . ,hn-l) 

which follows from the Markov property of X{t) and Bayes' Formula. Here p{xn \ hi, . . . , hn), denotes the probabil- 
ity density function of X{tn) conditional on Hi ~ hi, _ff„ = /i„, and will be referred to as the posterior density, 
and p{xn I Xn~i), denotes the density of X{tn) conditional on X{tn-i) — Xn-i and will be referred to as the predic- 
tive density. The interpretation of equation (|2]) is simple: it states that to obtain the posterior density at time tn from 
the posterior density at time one can first integrate the system from time to time tn with initial conditions 
drawn from the posterior density at time i„_i and then discount the resulting samples by a weight proportional to 
p {hn\xn) ■ In fact, if X{tn-i) is assumed to be drawn fromp (.t„_i | hi, . . . , /i„_i) , then 

E [/(X(t„)) \hi,..., = Eb(/.„|X(f„))] 

There are several methods by which one might hope to approximately carry out the recursion Perhaps the 
most obvious approach is to simply compute the integral using some quadrature scheme. This approach suffers from 
two insurmountable difficulties. The first is that there is often no closed form expression for the predictive density 
p{xn I x„^i). The second is that numerical quadrature becomes computationally impractical in more than a few di- 
mensions. Two popular approximation methods that have been applied successfully in various settings are Kalman 
and particle filters. The next two subsections briefly describe these two approaches. 

2. 1. Kalman filter. In the case that logp(/i„ | a;„) is a quadratic function of x„ and that both the initial density and the 
predictive density p{xn \ Xn-i) are Gaussian (which is true if the evolution of X{t) is governed by a linear equation), 
it is easy to see from (|2]i that the posterior density p | /ii, . . . , /i„) is also Gaussian. In this case to completely 
describe the posterior density one needs only find formulas for its mean and variance, which are obtained by easy 
manipulations of p(/i„ | x„), p{xn \ Xn^i), and the initial density. In fact, from (|2]i one can derive recursive formulas 
for the mean and the variance of p (x„ \hi, . . . , hn) ■ The resulting recursive scheme is called the Kalman filter (see 
I18J). 

There are several important derivatives of the Kalman filter. These methods use Gaussian approximations of the 
densities appearing in (|2]) (p(/i„ | Xn), p{xn \ Xn-i), and the initial condition) and are accurate in regimes in which 
these densities are nearly Gaussian. A particularly effective variant, the ensemble Kalman filter (see [llj), proceeds 
from an estimate of the mean m„_i and covariance E„_i of p [xn-i \hi, . . . , hn-i) as follows 

Algorithm 1 (Basic ensemble Kalman filter). 

(1) Generate M independent Gaussian vectors, {Xj(i„_i)}^-|^ , with mean 

m„_i and covariance Sn-i. 

(2) For each j, evolve Xj{tn-i) from time t„_i to time tn to generate an 
independent sample Xj{tn) from p(xn\ Xj{tn-i)). Note that, in general, 
these samples are no longer Gaussian. 

( ~ 1 *^ 

(3) Compute the sample mean 77i„ and sample covariance I]„ of <,Xj{t„)> 

(4) Analytically compute the mean rhn and covariance of the posterior 
Gaussian density 

C'-^pihn I a;„)e-5(^"-™")'s-^(x,.-m„)_ 
where C is a normalization factor. 

2.2. Particle filter. Particle filtering is a very general sequential importance sampling implementation of the recursion 
in (|2| (see IT4l|l9ll23ll22l l8l). Assuming that one has samples approximately generated from the posterior density at 
time tn-i, one first evolves the samples forward to time i„ (as in the ensemble Kalman filter). At this point one has 
samples approximately drawn from 

p{Xn \hi, . . .,hn^l) = / l)p {xn-1 \ hi,..., hn-l) dXn-1- 



One then assigns to a sample at position Xn the importance weight | proportional to 

p {x„ \ hi,.. . ,/i„) 

p{Xn \hi, . . . ,/l„_l) 

(the proportionality follows from (|2|). 

At the next observation time these steps are repeated. The sample weights from the previous observation time will 
be multiplied by weights corresponding to the new observation. Over several observation times this leads to highly 
degenerate weights and poor statistical properties of the resulting estimator To avoid this problem the standard particle 
filter includes a simple resampling step. The full procedure is summarized in the following algorithm (see lfT4l ): 

Algorithm 2 (Basic particle filter). 

(1) Begin with M weighted samples {{Xj{tn-i),Wj{tn-i))}^^ of 

p{Xn-l \ hi,...,hn^i). 

(2) For each j, evolve Xj(t„_i) from time <„_i to time tn to generate an 
independent sample Xj{tn) from p{xn\Xj{tn-i)). 

(3) Evaluate the weights, 

Wj{t^) = p{K I Xj{t^)) Wj{ta-l). 

(4) Resample the particles if necessary to obtain a weighted ensemble 
{{Xj{tn),Wj{tn))}f^-y approximating p{xn\hi, . . . ,hn). 

3. The small noise accurate observation regime 

Assume for the moment that in the particle filter, one resamples at each step so that Wj{tn) = 1/M. Notice that 
Step 2 is identical in both Algorithms [T] and |2] At the end of Step 2 both the ensemble Kalman filter and the particle 
filter have generated an empirical density, 

1 ^' 

(3) ^E'^(^"-^^(^")) 

approximating p(x„ | /ii , . . . , /i„_ i ) . In the ensemble Kalman filter this empirical density is approximated by a Gauss- 
ian density allowing the information from the observation at time i„ to be incorporated analytically. The particle filter 
uses the importance weights to transform (|3]l into a weighted empirical density approximating p(x„ . . . , /i„). 

The difficulty with both methods is that the empirical density ([3]) provides a very poor approximation of the tails 
of the density p{xn • ■ • , hn-i). Of course most observations (i.e. values of hn) do not correspond to tail events 
of p{xn I'll, • • • , hn-i). However, in many cases (e.g. the Kuroshio current), these tail events are precisely the most 
interesting features of the system and the events that one is primarily interested in capturing. 

The practical consequence of these tail events is easily understood by considering the behavior of a particle filter on 
the Kuroshio current. Suppose that between two observations the hidden signal makes a transition from one meander to 
another. If at the time of the first observation, all samples are in the small meander, then at best only a few trajectories 
will make the transition to the large meander. Therefore at the time of the next observation, when new likelihood 
weights are calculated, almost all of the samples will receive negligible weight, resulting in an estimator with very low 
accuracy. As we will see in the next section the way to overcome this problem is to use rare event sampling methods 
to bias the evolution of samples toward regions of space where the observation weights are large. In order to introduce 
these methods, let us first formaUze the discussion on rare events: indeed many methods have been proposed in the 
past to deal with these events (see f8','7','6l) but the issues they pose do not seem to have been identified precisely. 

To this end consider the situation in which the dynamics are governed by the stochastic differential equation 

X{t) = b{X{t)) + y^a{X{t)) B{t) 

(4) X{0) - xo 

where B is Gaussian white noise. Assume also that the observations admit a conditional density of the form 

(5) p(/i„|x„)oce-i»(''"'^"). 

In both equations e is a small parameter. In this setup both the stochastic forcing in the dynamics and the observation 
noise are small. 
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To make the case for rare event simulation tools in data assimilation problems we will focus on the particle filter 
However, as our numerical tests will highlight, the difficulties presented by rare events are shared by the Kalman filter 
family of methods as well. Consider the assimilation of a single observation Hi = hi at time <i starting from initial 
condition X{Q) — x^An particular consider the weights Wj generated in Step 3 of Algorithmji] When these weights 
are constant the samples Xj{ti) generated in Step 2 are exactly samples from the posterior density p{xi \hi). The 
variation in the weights is a measure of the distance of the empirical density (|3]l from the posterior density. It is natural 
then to consider the behavior of the relative standard deviation of the weights. 
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Hi 


= hi 


) \Hi=hi 
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\Hi 


= hi 





(6) 



It would be perfectly reasonable to omit the conditioning on Hi in the previous expression. However, we will be mod- 
ifying the dynamics of our system between observations (here between time and ti) based on the next observation 
(here at time <i) and therefore we must assume that the next observation is available. The weights at observation time 
ti are identically distributed so the subscript j in the previous display can be ignored. Expression (|6]l can be rewritten 

as 



(7) 

where 
(8) 



p = VR-1 



R 



E [Wf 



E 



,-2g(;ii,X(ti)) 



E 



43(hi,X(ti)) 



Note that R > 1 since the expectation of the square of a random variable is always greater than the square of its 
expectation. If i? = 1 the statistical error vanishes completely. By dividing the number of samples by i? — 1 one 
obtains the so called effective sample size, a rule of thumb giving the number of unweighted samples that would be 
equivalent (in terms of statistical accuracy) to an ensemble of weighted samples (see for example [8J). 

Let us estimate R in the present context. The Laplace Principle (see e.g. |l9][T0][30l) indicates that, under suitable 
assumptions. 



(9) 



E 



-i3(hi,X(ti)) 



-7i/£ 



where two functions of e are asymptotically equivalent (x) if the ratio of their logarithms converges to 1 as e 
The constant 71 is given by the formula 

71= inf {Ii^)+g{hi,ip{ti))} 
where the infimum is taken over all absolutely continuous functions, (p, from [0, ti] into M'^ and 



(10) 



In our context this simply says that in the small noise regime the posterior density is sharply peaked around the states 
that are most likely (in the small noise limit) given the observations. 
Applying the Laplace Principle one more time we obtain 



(11) 
where 

Notice that 
(12) 



E 



hih,X(ti)) 



-72/e 



72= inf {Ii^) + 2g{hi,^iti))}. 



72= inf {I{p) + 2g{hi,p{ti))} < inf {21 (p) + 2g {hi,^{ti))} ^ 2-fi. 

•fi(0)=xo ip{0)=xo 



In most cases the inequahty in ( [T2| l is strict and inserting (j9]) and 
0: 



hi other words, the number of samples needed to maintain accuracy increases exponentially as e — > 0. This is the 
primary challenge posed by rare events. 

Before closing this section we should remark that while the finite time horizon setting (finite ti) described here 
seems appropriate for designing on-line filtering methods, it is not the appropriate setting for studying the transition 
mechanism of the system itself. This transition has a natural time scale (which probably has nothing to do with the 
observation window <i) on which it occurs. Methods for analyzing such events can be found in ifTSl . 



1 1 1 in ([8]) implies that R increases exponentially as 



4. Importance sampling strategies 

The difficulty discussed in the previous section can be remedied within the context of importance sampling as 
we now describe. Recall our assumption that the original dynamics of the system are governed by the stochastic 
differential equation Q 

X{t) = h{X{t)) + ^ea{X{t)) B{t) 
X{Q) = xo 

We now modify these dynamics by adding an additional forcing term v{t, x) so that instead of evolving (|4]) we evolve 

(13) X{t)^h{X{t)) + a{X{t))v{t,X{t)) + ^~ea{X{t))B{t) 

X{0)^xo. 

Girsanov's Theorem (see e.g. ifTTl ') tells us that, for reasonable choices of v, the ratio of the probability of a path over 
the interval [0, t] under the original dynamics to a path under the modified dynamics ( [T3| ) is given by 

dQ 

Thus we can compute any expectation with respect to the original dynamics by instead computing a weighted expec- 
tation over the modified dynamics, i.e. 



E\f{X)] = I /(ijdPM = J f{x&x}dQ{x) = E 



where in this formal expression P is the measure of a path under the original dynamics and Q is the measure of a path 
under the dynamics biased by v. In particular. Steps ii and iii of Algorithm|2]can be replaced by 

ii' For each j, generate an independent sample Xj of the solution to the 
modified stochastic differential equation 

Xjit) = b{Xj{t))+a{Xj{t))v{t,Xj{t)) + ^ea{Xj{t))B,{t), t e [i„-i,tn] 

Xj{tn-l) — Xj(tn-l). 

and 

iii' Evaluate the weights, 
where 



_ 1 r*n 

The question then becomes how to choose v in ([T3]l so that the ratio between the variance of the new weights Wj 
and their mean square remain small, and we avoid the problem discussed in the previous section. This question is 
discussed next. 
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4.1. The single observation case. Let us again briefly assume that the process X is observed only once at time ti 
and that Hi = hi. Our goal is to choose the function v so that the relative standard deviation of the weights in Step 
i i i ' above is as small as possible. This is not a trivial task. To see what it entails, note that the relative standard 
deviation of the new weights is 

where now (compare ([HJ) 



E 



(14) 



(15) 



R 



ii,X(ti)) rfP 
dQ 



E 



E 



-^g(hi,X(ti)) dP 
dQ 

^-^g{lH.X{t,))-^ f;,^v(s.X(s)).dB(s))~^ f;;^\v(s.X{s))\''ds 



E 



.ig(/ii,X(ti)) 



As we have explained before, to control the relative standard deviation of the weights one must control R. Interest- 
ingly, there is one choice of the biasing function v which results in no statistical error at all (i.e. such that R — 1). This 
choice will turn out to be impractical. However it will be useful to keep this "gold standard" in mind as we consider 
more practical possibilities. Specifically, if we define 

then it is possible to show that R = 1 for every e if we take v = with 
(16) ^- 



In other words the process X in ( pjj ) with v = v"^ samples exactly from the density of X given the observation at time 
ti. With this choice reweighting of the samples is unnecessary. The function $^ satisfies the backward Kolmogorov 
equation 

e 



(17) 



1 



with terminal condition ~ Q-kaihi-.x) ^ could, therefore, attempt to discretize and solve the partial dif- 

ferential equation ( [T7] l to approximate ([T6|. Unfortunately this strategy is impractical in more than a few dimensions. 
These ideas, however, can be modified and put to use in high dimensions. Notice that for each e > the function 

(18) G'^^-elog*"^ 

solves the second order Hamilton-Jacobi equation 

(19) 

with terminal condition G"^(ti, x) — g{hi, x), where 

(20) nix,p) = -{bix),p) 
In terms of G"^, the function v"^ in ( [T6] l can be written 

(21) v'^-a^D^G". 

It is natural to replace by its zero viscosity approximation G, i.e. by the viscosity solution to the first order 
Hamilton-Jacobi equation, 

(22) -dtG + 'H{x,D^G(t,x)) = 
with terminal condition G{ti,x) = g{hi, x), and to use the function 

(23) v°^~a^D^G 
in place of the choice in ( [T6| l. 



Direct solution of equation p2| ) is, again, not practical in more than a few dimensions. However, under mild 
assumptions (see 13] [121), the solution of this equation has an optimal control representation 



(24) G{t,x)^ inf {It,tA^)+9ihi,^{ti))} 
where the infimum is taken over all absolutely continuous functions on [i, ti] and 

Notice that the constant 71 in formula (|9]) is given by 71 ~ G(0, xq). 

Assume that at any point {t, x) there is one minimal trajectory in ([24]), i.e. a trajectory (ft^^ such that 

(25) G{t, x) = It^ti {•i>t,x) + 9 {hi,ipt,x{ti)) ■ 

Such a function cpt.x will be called the optimal control trajectory at (t, x). It can be shown that 

(26) v''{t,x) = c7-\x) (^iptAt) - K^)) ■ 

From expression ( |26] l it is clear that the choice ( |23] l requires that one solves the variational problem ( |24] i at each 
point along the path of X to find an optimal control trajectory at (s, X{s)) for each s e [0, ti]. Clearly, for problems 
in high dimensions solving ( |24] l is not a trivial task and carrying this out "on-the-fly" could impose a significant 
computational burden. In fact, the cost of generating each sample trajectory of ( [T3] l with this choice of v is quadratic 
in ti . However, when samphng sufficiently rare events this cost is more than made up for by the favorable statistical 
properties of the estimator corresponding to ( |23| ). The results in 1291 imply that for in ( |23] l, 

limi? = 1. 

e->-0 

This should be contrasted with the case of the standard particle filter for which we recall that R grows exponentially 
with e~^. Notwithstanding the results in ||29l . for the extremely high dimensional models common in climate and 
weather prediction it seems necessary to search for approximations which, while they will not match the statistical 
performance of the choice in ( |23] l, are less computationally burdensome. 

In the following we suggest two possible choices for v that in our numerical experiments seem to be inexpensive 
and effective. Both methods require solving the optimization problem (|24]) at least once. The optimization problem 
( p4j l is highly structured and in our numerical experiments the cost of the each solution of (|24]| is roughly ten times the 
cost of simulating a single trajectory of the original system Q. In fact ( |24] l is very similar to the variational problem 
at the heart of the 4DVar algorithm (see fTJ] and the references therein) and it is likely that algorithms already in use 
within the data assimilation community can be leveraged. 

One natural further approximation of the optimal choice v"^ is obtained by choosing a parameter < t < and, 
for k <ti /r, solving 

X{t) = h{X{t)) + a{X{t)) v{t) + ^a{X{t)) B{t) 

(27) i;(t) for t e [fcr, (A; + l)r]. 

Over each interval of length t we have simply replaced the spatial variable in by values along the optimal trajectory 
ip starting from the position of the sample X at the end of the previous length r interval. Note that on [fcr, (fc + l)r] 
we can write 

V{t) = (T'\'Pkr,X(kr)i^)) (<^/cr,X(fer) W " ^('^fcr,X(fcr) W)) ' 

For example, the choice t — ti yields 

v{t) = ti°(t, (^o,a;(i)) = Cr"^('^0,a;(i)) ('^0,x(t) ^ 6(c^o.a; (i))) ■ 
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4.2. Some simple recursive schemes for the multiple observation case. The picture that we have sketched changes 
sHghtly when assimilating multiple sequential observations. In general no recursive algorithm can completely avoid 
poor performance as e — > 0. To see this suppose that between each observation in Algorithm |2] (with Steps ii ' and 
iii' ) we use the biasing function v'^. In the single observation case the choice v"^ led to perfect sampling. In the 
multiple observation case however, after assimilating the first observation we are forced to use some approximation 
of the posterior density at time ti (an empirical approximation in the case of Algorithm |2]i. Unfortunately the states 
at time ti that are most consistent with the observation at time t2 may be far into tail of the time ti posterior density. 
Resolution in this tail will be very poor leading to large errors as e vanishes. 

One solution to this problem is to go back to time to and choose a function v that incorporates the observations at 
both ti and ^2 • In principle the states at time ti that are most relevant to any future observation (not just at ^2) may lie 
in the tail of the time ti posterior density and one might have to "backtrack" many observational steps each time a new 
observation arrives. Though potentially costly this backtracking is fairly easy to implement. However, our numerical 
results indicate that significant statistical improvements are possible without backtracking and we will not pursue the 
topic further here. 

The first algorithm that we suggest is a simple adaptation of the biasing function in ( |27] l to the particle filtering 
framework: 

Algorithm 3. 

(1) Begin with M unweighted samples {Xj{tn-i)}jLi approximately drawn from 

p{Xn-l |/ll,...,/).„-l). 

(2) For each j generate an independent sample Xj of the solution to ( |27| l on 
[tn-i,tn] starting from Label the corresponding trajectory of v 
in ([27| by Vj{t), te[t„_i,t„]. 

(3) Evaluate the weights, 

where 

Zjj — e 

(4) Resample the particles if necessary to obtain a weighted ensemble 
{{Xj{tn),Wj{tn))}f approximating p(a;„ | /ii, . . . , /i„). 

While fairly simple and, in our tests, effective. Algorithm |3] still requires that one solve the optimization problem 
( |24| i once for each particle at each observation. In practice this may not be necessary. The following ad-hoc algorithm 
solves the optimization problem only once at each step. The reader should note the similiarity with the ensemble 
Kalman filter, though as we will see, on some problems Algorithm|4]has significant advantages. 

Algorithm 4. 

(1) Begin with M weighted samples {(Xj(t„_i), Wj(t„_i))}j'£]^ approximating 

p[Xn-l \ hi,. . . ,hn-l). 

(2) Determine the sample mean m„_i and covariance of the {Xj{tn-i)}^Li- 

(3) Find the solution ip of the optimization problem 

I 1/2 |2 1 

infj -ft„_i,t„(v) + 2l^n-i {^{tn-i)-rnn-i)\ + g {K,ip{tn)) > 

(4) For each j, generate an independent sample Xj of the solution to 

X,{t) = h{X,{t))+a{X,{t))v{t) + ^ea{Xj{t))B{t), t G [tn-iM 

where Xj{tn-i) is a Gaussian random vector with mean (^(t„_i) and 
covariance Here 

v{t) = a-\m){m-hm)))- 
10 



(5) Compute the mean to„ and covariance by 

n^n = — i7 and Zj„ = — tj 

where 



and 



7 _ 



Notice that Algorithm |4] only requires the solution of one optimization problem per observation time. In our tests 
the cost of Step iii is only about 10 times the cost of the evolution of a single trajectory in Step iv, i.e. the cost 
of running Algorithm]?] is comparable to the cost of running Algorithm ]2] with 10 additional particles. Thus it seems 
that this algorithm is only marginally more expensive than the standard particle filter. The reader should appreciate 
that while this algorithm does assume that the posterior is Gaussian, it does not assume that the predictive density is 
Gaussian. This is the key to its advantage over the ensemble Kalman filter. For some problems this algorithm may 
benefit from the addition of a particle clustering step and a separate solution of the optimization problem in Step iii 
for each cluster (with different mean and covariance for each cluster). Finally we mention that while Algorithm ]3] 
employs a resampling step, one can easily imagine variants that do not. For example, one can modify these algorithms 
to incorporate the sample transformations in |4|. 

5. A SIMPLE MODEL OF THE KUROSHIO 

The filtering algorithms described in the previous sections will be tested on a stochastic perturbation of the barotropic 
vorticity equation introduced by |5| to model the Kuroshio current: 



(28) d,X + -{uX) + -{vX) + fy-j -^ju + f^^-^jv = r.AX + arj 

Here X{t, x, y) is the vorticity, r{x, y) is the water depth, f{x, y) is the Coriolis parameter, v is the horizontal eddy 
diffusivity, (u, v) are the velocities in the x and y directions, respectively, and 77 is a space-time white noise whose 
amplitude is controlled by a. 

In this model, as in Chao's original model, the coordinates x, y are rotated 20deg counter-clockwise from North- 
South, and (]28]l is solved in the domain D shown in Fig.]3] The viscosity and noise parameters are set to 

u = 8x 10"^ and ct = 6 x 10"^^ 

The Coriolis parameter is given by 

f = k + fxx + fyy 

where 

fx = f3 sin (20°) and fy^P cos (20°) 
and /3 and /g are given by /3 = 2 x 10^*^ and fo — 7 x 10^^. Unless otherwise specified, all distances are measured 
in kilometers and time is measured in seconds. The function r {x, y) is the water depth and away from the two bumps 
that model the Izu Ridge, it is set to the fixed value of 1 kilometer The northern bump is defined by 



rjv (x, y) = 0.5 cos 

for 



'tt ^(x - 1410)2 + (?/- 1020)2' 



2 90 



^(x - 1410)2 + (7/- 1020)2 < 90 

and the southern bump is defined by 



rs {x, y) 




Kyushu 
Wedge 



□ _270km 



720km 



840km 



(x-,y-) 



Izu Ridge 



2220km 



810km 



1 020km 



Figures. Model geometry. 



for 




where 

, (x - 1410) + (y- 780) 
X = ^ 

and 

, (x - 1410) - (y - 780) 
V2 ■ 

The horizontal velocities u and v satisfy 

{ru,rv) = (-1py,1pa:) 

where the volume transport streamfunction tp solves 

dx (r^^) ~^ dy (r'^^) 

The boundary conditions are 



I 




0, 


X 


= 0, 


at y = 0, 


II 




—33 Sv, 




= 0, 


along the northern boundary, 


III 




0, 




= 0, 


at a; = 0, 


IX 






'4^xx 


= 0, 


at a; = 2220 



where 

K{y) = for y< 870 

and 

K[y) = -33 Sv for y > 870. 

In the above formulas 'Sv' stands for 'Sverdrup' and 1 Sv represents a volume transport of 10^^ km^s^^. 

While ( |28| ) is far from a state of the art geophysical representation of the Kuroshio current, it is able to reproduce 
the large and small meanders of the current: in the derterministic model, these states are basins of attraction of the 
system forced by the Kyushu wedge and Izu ridge (see Figure [3]l. Both meanders coexist only for certain inflow 
volume conditions. For the deterministic model (a = 0) Chao demonstrated that it is possible to observe the transition 
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Figure 4. Small meander state of the model, (|28|. The domain is as depicted in Figure[3] 




Figure 5. Large meander state of the model, ( |28| l. The domain is as depicted in Figure |3] 



between meanders by varying the inflow condition. The stochastic modification ( |28] l of Chao's model was introduced 
by one of the authors in 131] and |32l where it was used to test a path sampling based filtering algorithm. 

In our calculations, we discretized (|28]l using a uniform 30 kilometer grid exactly as described in lf32l l3Tt and 
in Appendix A. The 2516 dimensional discrete time process so obtained (which we also denote by X) exhibits two 
meta-stable states that are qualitatively similar to the small and large meanders of the actual Kuroshio current. Figures 
|4] and |5] show typical states in both of these meanders. They were found by varying the northern boundary condition 
as in 151. The noise parameter in ( |28] l (a) was set to for the purposes of generating Figures]?] and |5] 

Let {x* ,y*) denote the point in D that is 990km from the western boundary and 860fcTO from the southern bound- 
ary. This point is pictured in Figure ]3] The bimodality of the system is evident in Figure ]6] which shows a long 
trajectory of the system projected onto the variable ,y*)- The state for which ip{x* , y*) w 15Sv roughly corre- 
sponds to the small meander and the state for which tp{x* ,y*) w ~20Sv roughly corresponds to the large meander 
As can also be seen in Figure ]6] the discrete stochastic system tends to remain in each of its meanders for roughly 10 
years. Transitions between the two meanders usually occur in a time span of a few months (though we again caution 
that this model is far from state of the art). 
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Figure 6 . Time series of approximation to -0 (x* , j/* ) where {x* ,y*) denotes the point in D that is 
990fcm from the western boundary and 860fcr7i from the southern boundary. Notice the transitions 
between a metastable state near 15Sv and one near —20Sv. 

Conditioned on the state X(i„), the observation at time i„, iJ„, is a Gaussian random variable with mean i}{tn)k*,m* 
and standard deviation 1.92918. Thus the discrete vorticity process X is observed through the value of the correspond- 
ing volume transport process ^/'(i„) at the single point (fc*, m*). The observation times are separated by 2.63 days. 



5.1. Test results. Using the setup described in the previous section we perform a simple but revealing test of the 
ability of the four filters discussed in this paper to track a single transition of the system from one meander to another 
All Algorithms begin from a single initial state in the small meander. The observations are plotted in Figures |7] and 
[9 12 and correspond to a segment of a long simulation of the system in which the system transitions from the small 



meander to the large meander We do not add noise to these observations. The resulting estimates produced by each 
of the four algorithms are displayed in Figure [7] Evidently Algorithms [3] and |4] are able to properly assimilate the 
change in the underlying state while the particle filter and the ensemble Kalman filter do not perform as well. The 
ensemble Kalman filter completely misses the transition while the particle filter seems to eventually capture it but 
remains relatively far from the estimates produced by Algorithms |3] and [4] 

While for a large enough number of samples the particle filter would give the correct result, it is important to 
observe that the results for the ensemble Kalman filter are fully converged. No matter how many particles are used 
in the ensemble Kalman filter it will not be able to effectively track the transition. This is because, as discussed 
earlier, the ensemble Kalman filter uses an extremely poor approximation of the tail of the predictive density, resulting 
in an estimate that is nearly arbitrary in problems highly sensitive to tail characteristics. The particular transition 
that we have used to define the observations is more rapid than most of the transitions that we observed in our long 
simulation of the system. On a slower transition the particle filter will show improved results. The ensemble Kalman 
filter performed poorly in all the cases that we examined. In our implementations of the ensemble Kalman filter and 
Algorithm [4] we make the assumption that the posterior covariance matrix ( E„) is diagonal though we do not assume 
that the Kalman gain matrix is diagonal in the calculation of to„ in the ensemble Kalman filter (through an application 
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Figure 7. Trajectory of estimate of ip{x* ,y*) given observation for the four algorithms outlined 
in this paper, {x* ,y*) is the point in D that is 990to from the western boundary and 860m from the 
southern boundary. The particle filter, the ensemble Kalman filter, and Algorithm [4] are all run with 
100 particles while Algorithm [3] is run with 10 particles. The estimates produced by Algorithms |3] 
and|4]track the signal more closely than the ensemble Kalman filter and the particle filter 



of the Sherman-Morrison formula). In the ensemble Kalman filter we also assume that the predictive covariance matrix 
(I]„) is diagonal. 

Only two of the schemes tested (Algorithm|3]and the particle filter) are statistically consistent in the sense that they 
exactly reproduce the posterior mean given a large enough sample size. For these two schemes we give the values of 
R (defined in (|8]l and ( [l4j i, respectively) in Figure |8] Recall that by dividing the number of samples by i? — 1 one 
obtains the so-called effective sample size, a rule of thumb giving the number of unweighted samples that would be 
equivalent (in terms of statistical accuracy) to an ensemble of weighted samples (see for example f8|). The results 
suggest that the ensemble generated by Algorithm |3] is of much higher statistical quality than the ensemble generated 
by the particle filter in the sense that to achieve the same accuracy as Algorithm [3] the particle filter would require as 
many as 100 times the number of samples. We do not compare similar statistics for the two more approximate schemes 
(the ensemble Kalman filter and Algorithm]?]). 

We run the particle filter and the ensemble Kalman filter with 100 particles each (M = 100). Algorithm ]3] is run 
with only 10 particles (M = 10). Algorithm]?] is mn with 100 particles (M = 100). In Algorithm ]3] we set r = 0.263 
days (see expression (]27]l). With these choices, the particle filter, the ensemble Kalman filter, and Algorithm ]4] have 
roughly equal cost. Algorithm ]3] is several times more expensive than the others. The optimizations in Algorithm 
]3] and ]4] typically converge to an acceptable level of accuracy in a handful of iterations (using a total of 10 Jacobian 
multiplications or less). For more details on how these optimizations are carried out see Appendix B. 

In Figures]9 - 12 we represent the empirical densities produced by the four algorithms. In each figure a vertical line is 
drawn at each sample value of 4> {x* , ) at every observation time. The height of each line is the weight of that sample. 
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Figure 8. The values of R defined in equations ([8]) (for the particle filter with 100 particles) and 
([T4]i (for Algorithm[3]with 10 particles). The effective sample size (see for example |8 |) for the two 
algorithms is found by dividing the number of particles by i? — 1. We plot the R values only for the 
two statistically consistent schemes (Algorithm |3] and the particle filter), that is the schemes that we 
know to converge to the correct density as the number of samples is increased. This plot suggests 
that the ensemble generated by Algorithm [3] is of much higher statistical quality than the ensemble 
generated by the particle filter in the sense that to achieve the same accuracy as Algorithm |3] the 
particle filter would require as many as 100 times the number of samples. It is not a meaningful 
statistic for approximate schemes such as Algorithm]?] or the ensemble Kalman filter. 

One can clearly see that the ensembles generated by the particle filter are dominated by a single particle at many of 
the observation times while the other schemes produce more regular weights. Again we can see that the ensembles 
generated by the particle filter and the ensemble Kalman filter are concentrated much farther from the observations 
than the ensembles generated by Algorithms |3] and ]4] Of course proximity to the observation is not a reliable measure 
of success in filtering (proximity to the posterior density is our goal). However, given that both Algorithms ]3] and ]4] 
agree, we conclude that they are accurate. 

6. Conclusion 

We have suggested that the failure of standard filtering algorithms such as ensemble Kalman filters and particle fil- 
ters on certain problems can be easily understood and analyzed in a low noise regime. As a first attempt at overcoming 
these issues we propose two schemes based on rare event simulation tools. The field of rare event simulation has seen 
explosive growth over the last decade or so and our investigation here is far from an exhaustive study of the potential 
utility of rare event tools within data assimilation. However, our results do strongly indicate that more investigation in 
this direction is warranted. 
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Algorithm 3 (10 particles) 




Figure 9. Plot of the values of tp{x* ,y*) corresponding to the samples Xj generated in Step ii 
of Algorithm [3] (with 10 particles) at each observation time. The z-coordinate of each point is the 
value of the weights (normalized) computed in Step iii of the algorithm. Most the 10 particles 
have weight comparable to the mean weight (in this case 0.1) and the particle values of ■ip{x* , y*) 
are clustered near the observation. 



In the context of the Kuroshio where standard methods fail, our new methods inspired by rare event simulation 
techniques are very effective. Of course, the application of these methods is not limited to the study of the Kuroshio 
current. Even restricting our attention to geophysics, one can imagine important applications such as the prediction 
of short and medium term extreme geophysical events like tsunamis, hurricanes, and drought, as well as paleoclimate 
data assimilation. We also remark that the hybrid nature of our schemes suggest that they could be combined with 
existing large scale data assimilation code employed in a black box fashion to design new robust algorithms. 

In our numerical test we have chosen to focus on a system with an obvious, interesting tail feature (the meander 
transition). While problems of this kind provide stark evidence that current filtering practice can be inadequate, it is 
important to point out that the same failures can occur on problems with more mundane tail behaviors. Indeed, over 
long periods of time one can expect nearly any underlying ("hidden") process to undergo rare excursions. In exactly 
the way that we have demonstrated, these excursions will cause standard methods to loose track of the signal. 

Finally we would like to point out that the methods proposed here are related to schemes proposed in |j7l |6l and 
called "implicit sampling" methods. In those schemes one attempts to construct a map taking an easily sampled 
random variable (usually a multidimensional Gaussian) to a random variable distributed according to the posterior 
density. The function v'^ defined in ([T6| is an exact solution to this problem. It can be used to map the trajectory of a 
Brownian motion (an easily sampled random variable) to a random variable (the corresponding X{T)) drawn exactly 
from the posterior density. In the low noise regime that we focus on, good statistical behavior does not require that 
the mapping be exact. For example, the theoretical results in [29] indicate that an asymptotic (in the small noise limit) 
approximation of {v'^ in ( |23| )) can yield strikingly good error behavior The importance sampling schemes at the 
core of Algorithms [3] and |4] are based on further approximations of v"^ and perform extremely well in our tests. 
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Algorithm 4 (100 particles) 
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Figure 10. Plot of the values of ip{x*,y*) corresponding to the samples Xj generated in Step 
iv of Algorithm [4] (with 100 particles) at each observation time. The z-coordinate of each point is 
the value of the weights (normalized) computed in Step v of the algorithm. All 100 particles have 
weight comparable to the mean weight (in this case 0.01) and the particle values of ip{x* ,y*) are 
clustered near the observation. 
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Appendix A: Discretized Kuroshio model 

Let Ax — 30km denote the spatial mesh size which is the same in both the x and y directions. For any function g 
on D define 



for ax, ay £ [—1,1]. Define the operators 



Sx9 = 



fj-xg 



.9/0+1/2, m ^ gk~l/2.m 
5/0+1/2, m + 5/0-1/2, m 



Syg = 



fj-yg 



9k,m+l/2 ~ 5/c,m-l/2 

a; ' 

.9i:,m+l/2 +5/o,m-l/2 



Particle Filter (100 particles) 




Figure 1 1 . Plot of the values of ip{x* , y* ) corresponding to the samples Xj generated in Step i i 
of the standard particle filter (with 100 particles) at each observation time. The z-coordinate of each 
point is the value of the weights (normalized) computed in Step i i i of the algorithm. At several 
steps there is only one particle with appreciable weight and that particle's value of ijj{x* ,y*) is far 
from the observations compared to the corresponding values generated by Algorithms [3] and |4] 



and 

(29) iO^^JM+A 



First, ( |28| l is discretized in space using a simple centered difference scheme which involves only values of X at points 
{kAr^, mAj.) . After replacing X by its restriction to these points the system becomes a set of ordinary stochastic 
differential equations, 

(30) ^fc,™(t) = Fk.^ {X{t)) + -^aBk.,m{t) 

where 



(31) Ffc,™ {X{t)) = -Dl (uu,,Xk,j) - Dl {vk,„rXk,„r) 

.ra J t<,.iii \ p ' ^ V \ I 

where ip solves 

(32) LV = X 
and 



Ensemble Kalman Filter (100 particles) 
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Figure 12. Plot of the values of ip{x* ,y*) corresponding to the samples Xj generated in Step ii 
of the ensemble Kalman filter (with 100 particles) at each observation time. The z-coordinate of 
each point is 0.01 (the weights in this scheme are constant). The particle values of iIj{x* , y*) are far 
from the observations compared to the corresponding values generated by Algorithms [3] and |4] 



h h 

The Bk.m are independent Brownian motions. Consistent with the previous sections, X{tn) denotes the solution of 
( |33| l at the time of kh observation. 

These stochastic ordinary differential equations cannot be solved explicitly and therefore require numerical solution. 
Here the spatially discretized system dSOll is discretized in time as 



+ {Fk,„iiXk,m) + Fk,„i{Xk^„i{nA))) y + ^^ai]k,m{n) 



(33) Xk,r,r{{n + 1)A) = Xfe,„(nA) 



where 

Xk.,n = Xk,rn{nA) + Fk,m {Xk.m{nA)) A + ^^^(T-qk,m{n) 

and now for each fc, m. and n, rjk.mi'n) is an independent Gaussian random variable with mean and variance 1. The 
resulting method is adequate for the relatively low Reynolds number flow considered here. A Crank Nicholson type 
scheme was not applied to the linear part of the equation because the stiffness of the system is not dominated by the 
diffusion term at the level of discretization used here. 
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Appendix B: Solving for 







Both Algorithms |3] and |4] require solving an optimization problem at least once at each observation time. As both 
problems are similar we will focus on the minimization of functionals of the form 

iHvi-'^))-' - b{^{s))) \'ds + g{^{T)). 

Note that the minimizer of this action solves the ordinary differential equation 

if = b + (7U 

where u G L^([0,T]) minimizes 

^ ^\u{s)\^ds + g{^^{T)). 

Because it tends to be well scaled, it is this cost function (after time discretization) that we choose to optimize. The 
discretization corresponding to the discrete version of the process described in the last appendix is simply 

T/A-l ^ 

(34) J2 7,Hn)\'^ + g{MT)) 



where now 



and 



ipuiin + 1)A) = ipu{nA) + ^{b{^u) + 6(v3„(nA))) + ^cru(n) 



In the minimization of p4\ we use the conjugate gradient method (thereby avoiding the formation of any Jacobian 
matrices) within a trust region framework (see in particular the algorithms on pages 69 and 171 in E4\ ). A disad- 
vantage of working in the u variables is that the evaluation of (pu (T) is a serial operation making the procedure less 
amenable to parallel implementation. Finally note that in Algorithm[3]requires repeated minimization of function of 
the form ( (34] i during the generation of each sample trajectory. This can be accelerated by the use of a continuation 
strategy, using the result of the previous optimization to initiate the next. 
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